菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-23
📄 Abstract - Data Scale, Not Latency, Shapes Cross-Lingual Encoder Transfer in Streaming ASR

Adapting a streaming speech recognition model to a new language requires choosing between two plausible warm starts: a multilingual (ML) encoder or an English-only (EN) encoder. The common intuition is that the multilingual encoder should help most at low data, but it is unclear how long that advantage persists, whether tight streaming latency amplifies it, and whether it survives deployment quantization. We answer these questions with a controlled sweep of a 0.6 B-parameter cache-aware FastConformer transducer across eight European languages, up to five target-language data scales (100 h to 2500 h), three streaming tiers plus offline decoding, and up to four public test sets. The main result is that multilingual initialization is a data-limited advantage, not a latency-limited one. On FLEURS at 160 ms, the mean EN-ML word error rate (WER) gap falls from +4.21 percentage points (pp) at 100 h to +0.20 pp at 2500 h; a power-law fit summarizes this decay, with each doubling of target-language data roughly halving the remaining advantage. Across the three streaming tiers, the across-language mean EN-ML gap is approximately stable at each scale from 100 to 1000 h, and is near zero by 2500 h. Finally, 4-bit weight-only encoder quantization at the matched 560 ms streaming tier reduces the encoder footprint by about 3x, with an average FLEURS WER increase of about 0.5 pp. The resulting guideline is simple: use multilingual initialization in low-data regimes, treat the choice as effectively irrelevant at large data, and make latency and quantization decisions independently.

顶级标签: audio model training machine learning
详细标签: streaming asr cross-lingual transfer encoder initialization quantization data scaling 或 搜索:

数据规模而非延迟决定了流式语音识别中跨语言编码器的迁移效果 / Data Scale, Not Latency, Shapes Cross-Lingual Encoder Transfer in Streaming ASR


1️⃣ 一句话总结

本文通过大规模实验发现,在将流式语音识别模型迁移到新语言时,使用多语言预训练编码器的优势主要取决于目标语言的数据量(数据少时优势明显,数据充足时优势消失),而与流式传输的延迟要求无关,同时模型量化对结果影响很小。

源自 arXiv: 2606.24169