← 返回列表

arXiv 提交日期: 2026-02-17

📄 Abstract - UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling

This work considers merging two independent models, TTS and A2F, into a unified model to enable internal feature transfer, thereby improving the consistency between audio and facial expressions generated from text. We also discuss the extension of the emotion control mechanism from TTS to the joint model. This work does not aim to showcase generation quality; instead, from a system design perspective, it validates the feasibility of reusing intermediate representations from TTS for joint modeling of speech and facial expressions, and provides engineering practice references for subsequent speech expression co-design. The project code has been open source at: this https URL

顶级标签: multi-modal audio systems

UniTAF：用于联合文本到语音和音频到面部建模的模块化框架 / UniTAF: A Modular Framework for Joint Text-to-Speech and Audio-to-Face Modeling

1️⃣ 一句话总结

这篇论文提出了一个名为UniTAF的模块化框架，它将独立的文本转语音和音频转面部表情模型合并为一个统一模型，通过共享内部特征来提升文本生成语音和面部表情的一致性，并从系统设计角度验证了这种联合建模的可行性。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2602.15651

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要