WavSLM:通过WavLM蒸馏实现单流语音语言建模 / WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation
1️⃣ 一句话总结
这篇论文提出了一种名为WavSLM的新型语音语言模型,它通过将自监督语音表征蒸馏并量化为单一码本,实现了无需文本监督、仅用单一数据流就能同时建模语音的语义和声学信息,从而简化了模型结构并支持流式推理。
Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to speech remains challenging due to the entanglement of semantic and acoustic information. Most existing speech language models rely on text supervision, hierarchical token streams, or complex hybrid architectures, departing from the single-stream generative pretraining paradigm that has proven effective in text. In this work, we introduce WavSLM, a speech language model trained by quantizing and distilling self-supervised WavLM representations into a single codebook and optimizing an autoregressive next-chunk prediction objective. WavSLM jointly models semantic and acoustic information within a single token stream without text supervision or text pretraining. Despite its simplicity, it achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference. Demo samples are available at this https URL.
WavSLM:通过WavLM蒸馏实现单流语音语言建模 / WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation
这篇论文提出了一种名为WavSLM的新型语音语言模型,它通过将自监督语音表征蒸馏并量化为单一码本,实现了无需文本监督、仅用单一数据流就能同时建模语音的语义和声学信息,从而简化了模型结构并支持流式推理。
源自 arXiv: 2603.05299