菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-08
📄 Abstract - ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation

Text-to-video (T2V) generation has advanced rapidly, yet maintaining consistent character identities across scenes remains a major challenge. Existing personalization methods often focus on facial identity but fail to preserve broader contextual cues such as hairstyle, outfit, and body shape, which are critical for visual coherence. We propose \textbf{ContextAnyone}, a context-aware diffusion framework that achieves character-consistent video generation from text and a single reference image. Our method jointly reconstructs the reference image and generates new video frames, enabling the model to fully perceive and utilize reference information. Reference information is effectively integrated into a DiT-based diffusion backbone through a novel Emphasize-Attention module that selectively reinforces reference-aware features and prevents identity drift across frames. A dual-guidance loss combines diffusion and reference reconstruction objectives to enhance appearance fidelity, while the proposed Gap-RoPE positional embedding separates reference and video tokens to stabilize temporal modeling. Experiments demonstrate that ContextAnyone outperforms existing reference-to-video methods in identity consistency and visual quality, generating coherent and context-preserving character videos across diverse motions and scenes. Project page: \href{this https URL}{this https URL}.

顶级标签: video generation aigc computer vision
详细标签: text-to-video diffusion models character consistency reference-based generation context-aware 或 搜索:

ContextAnyone:面向角色一致性的文本到视频生成的上下文感知扩散框架 / ContextAnyone: Context-Aware Diffusion for Character-Consistent Text-to-Video Generation


1️⃣ 一句话总结

这篇论文提出了一种名为ContextAnyone的新方法,它能够仅凭一张参考图片和文字描述,就生成视频中角色形象(包括发型、服装、体型等所有特征)高度一致且动作自然的视频,解决了现有技术中角色形象在视频中容易“走样”的难题。


源自 arXiv: 2512.07328