菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-16
📄 Abstract - SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment

Designing a speech quality assessment (SQA) system for estimating mean-opinion-score (MOS) of multi-rate speech with varying sampling frequency (16-48 kHz) is a challenging task. The challenge arises due to the limited availability of a MOS-labeled training dataset comprising multi-rate speech samples. While self-supervised learning (SSL) models have been widely adopted in SQA to boost performance, a key limitation is that they are pretrained on 16 kHz speech and therefore discard high-frequency information present in higher sampling rates. To address this issue, we propose a spectrogram-augmented SSL method that incorporates high-frequency features (up to 48 kHz sampling rate) through a parallel-branch architecture. We further introduce a two-step training scheme: the model is first pre-trained on a large 48 kHz dataset and then fine-tuned on a smaller multi-rate dataset. Experimental results show that leveraging high-frequency information overlooked by SSL features is crucial for accurate multi-rate SQA, and that the proposed two-step training substantially improves generalization when multi-rate data is limited.

顶级标签: audio machine learning model training
详细标签: speech quality assessment self-supervised learning multi-rate speech spectral augmentation mean opinion score 或 搜索:

SA-SSL-MOS:基于谱增强自监督学习的广义多速率语音质量评估 / SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment


1️⃣ 一句话总结

这篇论文提出了一种结合高频信息增强的自监督学习方法,通过一个并行分支架构和两步训练策略,有效解决了现有模型因训练数据局限于16kHz而无法准确评估多种采样率(16-48kHz)语音质量的问题,显著提升了多速率语音质量评估的泛化能力。

源自 arXiv: 2602.14785