📄
Abstract - CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding
Neural audio codecs are a key component of speech processing pipelines, compressing audio into discrete tokens for downstream modeling. However, existing codecs struggle to balance reconstruction quality with token efficiency, often encoding perceptually irrelevant information such as background noise and recording artifacts at the expense of linguistically and acoustically meaningful content. We reframe audio tokenization as a selective information bottleneck problem and propose CleanCodec, a denoising audio codec which learns to encode only perceptually important features and discard imperceptible information. At just 12.5 tokens per second, CleanCodec achieves state-of-the-art tokenization efficiency, substantially outperforming existing codecs in speaker similarity and speech intelligibility. Evaluations on downstream text-to-speech and voice conversion tasks further demonstrate improved performance and up to 17x faster inference, highlighting significant efficiency gains.
CleanCodec:通过感知引导编码实现高效且鲁棒的语音分词化 /
CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding
1️⃣ 一句话总结
本文提出了一种名为CleanCodec的新型语音编解码器,它像一个智能过滤器,只提取语音中对人耳重要的关键信息(如说话人特征和语音清晰度),同时自动忽略背景噪音等无关信息,从而在极低的数据速率下实现更高效、更准确的语音重建,并大幅提升了后续语音合成任务的运行速度。