UniTAF:用于联合文本到语音和音频到面部建模的模块化框架 / UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling
1️⃣ 一句话总结
这篇论文提出了一个名为UniTAF的模块化框架,它将独立的文本转语音和音频转面部表情模型合并为一个统一模型,通过共享内部特征来提升文本生成语音和面部表情的一致性,并从系统设计角度验证了这种联合建模的可行性。
This work considers merging two independent models, TTS and A2F, into a unified model to enable internal feature transfer, thereby improving the consistency between audio and facial expressions generated from text. We also discuss the extension of the emotion control mechanism from TTS to the joint model. This work does not aim to showcase generation quality; instead, from a system design perspective, it validates the feasibility of reusing intermediate representations from TTS for joint modeling of speech and facial expressions, and provides engineering practice references for subsequent speech expression co-design. The project code has been open source at: this https URL
UniTAF:用于联合文本到语音和音频到面部建模的模块化框架 / UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling
这篇论文提出了一个名为UniTAF的模块化框架,它将独立的文本转语音和音频转面部表情模型合并为一个统一模型,通过共享内部特征来提升文本生成语音和面部表情的一致性,并从系统设计角度验证了这种联合建模的可行性。
源自 arXiv: 2602.15651