📄 论文总结
FuseCodec:面向神经编解码器的语义-上下文融合与监督方法 / FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs
1️⃣ 一句话总结
这篇论文提出了一种名为FuseCodec的新型语音编码方法,通过融合声学、语义和上下文信息并进行多层次的监督学习,显著提升了语音处理的准确度、自然度和说话人相似性,并在零样本语音合成任务中验证了其有效性。
Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at this https URL.
FuseCodec:面向神经编解码器的语义-上下文融合与监督方法 / FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs
这篇论文提出了一种名为FuseCodec的新型语音编码方法,通过融合声学、语义和上下文信息并进行多层次的监督学习,显著提升了语音处理的准确度、自然度和说话人相似性,并在零样本语音合成任务中验证了其有效性。