📄
Abstract - ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis
We investigate the use of zero-shot text-to-speech (ZS-TTS) as a data augmentation source for low-resource personalized speech synthesis. While synthetic augmentation can provide linguistically rich and phonetically diverse speech, naively mixing large amounts of synthetic speech with limited real recordings often leads to speaker similarity degradation during fine-tuning. To address this issue, we propose ZeSTA, a simple domain-conditioned training framework that distinguishes real and synthetic speech via a lightweight domain embedding, combined with real-data oversampling to stabilize adaptation under extremely limited target data, without modifying the base architecture. Experiments on LibriTTS and an in-house dataset with two ZS-TTS sources demonstrate that our approach improves speaker similarity over naive synthetic augmentation while preserving intelligibility and perceptual quality.
ZeSTA:基于领域条件训练的零样本TTS增强方法,用于数据高效个性化语音合成 /
ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis
1️⃣ 一句话总结
这项研究提出了一种名为ZeSTA的新方法,它通过给合成语音和真实语音打上不同的“领域标签”来帮助模型区分两者,从而在数据极少的情况下,安全地利用大量合成语音来训练高质量的个性化语音合成模型,既提升了合成声音与目标说话人的相似度,又保持了语音的清晰度和自然度。