菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-07-02
📄 Abstract - Using embeddings to predict spoken word duration and pitch in Mandarin monosyllabic words

Time-normalized f0 contours of Mandarin words in conversational speech have been shown to be predictable in part from their contextualized embeddings (CEs). The present study investigates whether CEs also predict spoken word duration for 7470 tokens of Mandarin monosyllabic CV words extracted from a Mandarin corpus of spontaneous speech. We show that CEs indeed are predictive for duration, above chance level, not only at the type level, but also at the level of individual tokens, as indicated by the results obtained with the type-wise and token-wise permutation baselines. We also show that the predicted durations are sufficiently precise to back-transform predicted f0 contours in [0,1] normalized time to contours on the ms time scale. The resulting predicted contours approximate empirical contours and also outperform a permutation baseline.

顶级标签: natural language processing audio
详细标签: speech duration pitch prediction embeddings mandarin tones spoken language 或 搜索:

利用词嵌入预测普通话单音节词语的口语时长和音高 / Using embeddings to predict spoken word duration and pitch in Mandarin monosyllabic words


1️⃣ 一句话总结

本研究通过分析7470个普通话单音节词语的真实语音数据,发现基于上下文的词嵌入(CEs)不仅能预测词语的音高(已为前人证实),还能显著预测其发音时长,且预测精度足以将标准化的音高曲线还原到实际毫秒时间尺度,为语音合成与理解提供了新的定量工具。

源自 arXiv: 2607.02002