菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-10
📄 Abstract - Composing Concepts from Images and Videos via Concept-prompt Binding

Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.

顶级标签: multi-modal model training computer vision
详细标签: visual concept composition diffusion transformers prompt binding temporal disentanglement one-shot learning 或 搜索:

通过概念提示绑定从图像和视频中组合概念 / Composing Concepts from Images and Videos via Concept-prompt Binding


1️⃣ 一句话总结

这篇论文提出了一种名为‘绑定与组合’的新方法,能够将图像和视频中的不同视觉元素(如物体、风格、动作)精准地提取出来,并灵活地组合成一个全新的、连贯的视觉内容,从而极大地提升了AI视觉创作的多样性和质量。


源自 arXiv: 2512.09824