量化口音语音合成中说话人嵌入与音系规则的交互作用 / Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis
1️⃣ 一句话总结
这项研究通过分析说话人特征与音系规则在合成口音语音时的相互影响,提出了一种新方法来衡量和提升语音合成系统对口音的控制能力与可解释性。
Many spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific accents. While effective, this approach offers limited interpretability and controllability, as embeddings also encode traits such as timbre and emotion. In this study, we analyze the interaction between speaker embeddings and linguistically motivated phonological rules in accented speech synthesis. Using American and British English as a case study, we implement rules for flapping, rhoticity, and vowel correspondences. We propose the phoneme shift rate (PSR), a novel metric quantifying how strongly embeddings preserve or override rule-based transformations. Experiments show that combining rules with embeddings yields more authentic accents, while embeddings can attenuate or overwrite rules, revealing entanglement between accent and speaker identity. Our findings highlight rules as a lever for accent control and a framework for evaluating disentanglement in speech generation.
量化口音语音合成中说话人嵌入与音系规则的交互作用 / Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis
这项研究通过分析说话人特征与音系规则在合成口音语音时的相互影响,提出了一种新方法来衡量和提升语音合成系统对口音的控制能力与可解释性。
源自 arXiv: 2601.14417