Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers

📄 Abstract - Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers

Recent advances in diffusion-based controllable visual generation have led to remarkable improvements in image quality. However, these powerful models are typically deployed on cloud servers due to their large computational demands, raising serious concerns about user data privacy. To enable secure and efficient on-device generation, we explore in this paper controllable diffusion models built upon linear attention architectures, which offer superior scalability and efficiency, even on edge devices. Yet, our experiments reveal that existing controllable generation frameworks, such as ControlNet and OminiControl, either lack the flexibility to support multiple heterogeneous condition types or suffer from slow convergence on such linear-attention models. To address these limitations, we propose a novel controllable diffusion framework tailored for linear attention backbones like SANA. The core of our method lies in a unified gated conditioning module working in a dual-path pipeline, which effectively integrates multi-type conditional inputs, such as spatially aligned and non-aligned cues. Extensive experiments on multiple tasks and benchmarks demonstrate that our approach achieves state-of-the-art controllable generation performance based on linear-attention models, surpassing existing methods in terms of fidelity and controllability.

无需多模态注意力的门控条件注入：迈向可控的线性注意力Transformer / Gated Condition Injection without Multimodal Attention: Towards Controllable Linear-Attention Transformers

1️⃣ 一句话总结

这篇论文提出了一种专门为高效线性注意力模型设计的新框架，通过一个统一的门控条件模块，成功解决了现有方法在整合多种控制信号时灵活性不足或训练缓慢的问题，从而在保护用户隐私的边缘设备上实现了高质量、可控的图像生成。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要