菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-16
📄 Abstract - Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference

Deploying high-quality automatic speech recognition (ASR) on edge devices requires models that jointly optimize accuracy, latency, and memory footprint while operating entirely on CPU without GPU acceleration. We conduct a systematic empirical study of state-of-the-art ASR architectures, encompassing encoder-decoder, transducer, and LLM-based paradigms, evaluated across batch, chunked, and streaming inference modes. Through a comprehensive benchmark of over 50 configurations spanning OpenAI Whisper, NVIDIA Nemotron, Parakeet TDT, Canary, Conformer Transducer, and Qwen3-ASR, we identify NVIDIA's Nemotron Speech Streaming as the strongest candidate for real-time English streaming on resource-constrained hardware. We then re-implement the complete streaming inference pipeline in ONNX Runtime and conduct a controlled evaluation of multiple post-training quantization strategies, including importance-weighted k-quant, mixed-precision schemes, and round-to-nearest quantization, combined with graph-level operator fusion. These optimizations reduce the model from 2.47 GB to as little as 0.67 GB while maintaining word error rate (WER) within 1% absolute of the full-precision PyTorch baseline. Our recommended configuration, the int4 k-quant variant, achieves 8.20% average streaming WER across eight standard benchmarks, running comfortably faster than real-time on CPU with 0.56 s algorithmic latency, establishing a new quality-efficiency Pareto point for on-device streaming ASR.

顶级标签: audio systems model evaluation
详细标签: automatic speech recognition streaming inference model compression on-device quantization 或 搜索:

突破设备端流式语音识别的极限:一个用于低延迟推理的紧凑、高精度英语模型 / Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference


1️⃣ 一句话总结

这篇论文通过系统评估多种主流语音识别架构,并优化了量化与推理流程,成功将一个高性能的流式语音识别模型压缩了约73%,在保持准确率几乎不变的同时,实现了在CPU上比实时更快的低延迟推理,为资源受限的设备端应用设定了新的效率标杆。

源自 arXiv: 2604.14493