IP-Adapter即一切:实现免微调的扩散模型说话人脸生成 / IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation
1️⃣ 一句话总结
本文提出一种无需微调的说话人脸生成方法,利用预训练的Stable Diffusion和IP-Adapter直接生成视频,并设计了三个无参数组件来解决身份漂移、唇形不同步和画面抖动问题,在唇形同步精度和视觉质量上超越了现有方法。
With the rapid advancement of diffusion models, talking face generation has made remarkable progress. However, existing diffusion-based methods still require task-specific fine-tuning and large-scale audiovisual datasets, resulting in high computational costs that hinder scalability and accessibility of diffusion-based approaches across the research community. To address this, we propose a finetuning-free paradigm that directly performs talking face generation using the pretrained weights of Stable Diffusion and IP-Adapter. This backbone leverages the visual embedding capability of IP-Adapter to mine lip-related semantics from the pretrained Stable Diffusion. To address the challenges of identity drift, synchronization errors, and temporal instability, we also design three trainable-parameterfree components: (1) the Structurist, which explicitly disentangles and reassembles lip and appearance features to mitigate identity drift and appearance distortion; (2) the Structure Controller, which adaptively refines embeddings based on quasi-monotonic motion trends for precise lip synchronization; and (3) the Noise Sensor, which introduces Gaussian prior to detect and suppress flicker and jitter artifacts and enhance temporal consistency. Experimental results show that our method outperforms existing SOTA approaches in both lip-sync accuracy (at least 0.16 gain in PCLD) and visual fidelity (at least 0.7 improvement in FID), establishing a novel fine-tuning-free diffusion framework for talking face generation.
IP-Adapter即一切:实现免微调的扩散模型说话人脸生成 / IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation
本文提出一种无需微调的说话人脸生成方法,利用预训练的Stable Diffusion和IP-Adapter直接生成视频,并设计了三个无参数组件来解决身份漂移、唇形不同步和画面抖动问题,在唇形同步精度和视觉质量上超越了现有方法。
源自 arXiv: 2605.30230