菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-07-02
📄 Abstract - InterCMDM: Block-Causal Diffusion for Autoregressive Human Interaction Generation

Text-conditioned human interaction generation must capture both long-range temporal causality within each individual and tightly coupled coordination between partners. Existing interaction diffusion models typically denoise full sequences using bidirectional attention, which obscures causality and hinders streaming and long-horizon generation. Autoregressive alternatives enforce causality but often suffer from temporal drift, leading to coordination degradation and unstable interaction dynamics over time. We propose InterCMDM, a block-causal latent diffusion framework for autoregressive two-person interaction generation. InterCMDM introduces a Dual-Stream Causal Diffusion Transformer that maintains separate causal streams for each person while modeling inter-person dependencies via unified dual-stream attention with multi-task attention masks. These masks unify interaction modeling within a single attention mechanism and support diverse coordination behaviors, including simultaneous actions, reactive responses, leader-follower dynamics, and independent motion. By training a single model across these mask configurations as a form of data augmentation, InterCMDM enables controllable interaction generation by simply selecting the desired attention mask at inference time. Finally, a block-wise diffusion objective enables stable latent rollout over long sequences without repeated decode-encode cycles. InterCMDM achieves state-of-the-art performance on InterHuman and Inter-X, improving text-motion alignment, realism, and long-horizon continuity.

顶级标签: multi-modal model training video generation
详细标签: human interaction generation autoregressive diffusion causal attention dual-stream transformer text-to-motion 或 搜索:

InterCMDM:用于自回归人体交互生成的分块因果扩散模型 / InterCMDM: Block-Causal Diffusion for Autoregressive Human Interaction Generation


1️⃣ 一句话总结

本文提出了一种名为InterCMDM的新模型,通过分块因果扩散和双流注意力机制,让AI能生成两个角色之间自然、连贯且长时间保持协调的交互动作,解决了现有模型无法同时兼顾因果时序和长程稳定性的问题。

源自 arXiv: 2607.01743