菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-13
📄 Abstract - End-to-End Video Character Replacement without Structural Guidance

Controllable video character replacement with a user-provided identity remains a challenging problem due to the lack of paired video data. Prior works have predominantly relied on a reconstruction-based paradigm that requires per-frame segmentation masks and explicit structural guidance (e.g., skeleton, depth). This reliance, however, severely limits their generalizability in complex scenarios involving occlusions, character-object interactions, unusual poses, or challenging illumination, often leading to visual artifacts and temporal inconsistencies. In this paper, we propose MoCha, a pioneering framework that bypasses these limitations by requiring only a single arbitrary frame mask. To effectively adapt the multi-modal input condition and enhance facial identity, we introduce a condition-aware RoPE and employ an RL-based post-training stage. Furthermore, to overcome the scarcity of qualified paired-training data, we propose a comprehensive data construction pipeline. Specifically, we design three specialized datasets: a high-fidelity rendered dataset built with Unreal Engine 5 (UE5), an expression-driven dataset synthesized by current portrait animation techniques, and an augmented dataset derived from existing video-mask pairs. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research. Please refer to our project page for more details: this http URL

顶级标签: computer vision video aigc
详细标签: video editing character replacement conditional generation synthetic data post-training 或 搜索:

无需结构引导的端到端视频人物替换 / End-to-End Video Character Replacement without Structural Guidance


1️⃣ 一句话总结

这篇论文提出了一种名为MoCha的新方法,它只需要一张任意帧的遮罩图,就能在复杂场景下实现高质量、时序连贯的视频人物替换,克服了以往方法依赖繁琐结构引导和配对数据的限制。

源自 arXiv: 2601.08587