菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-20
📄 Abstract - Rethinking Cross-Layer Information Routing in Diffusion Transformers

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.

顶级标签: computer vision model training
详细标签: diffusion transformers information routing residual stream image generation training acceleration 或 搜索:

重新思考扩散Transformer中的跨层信息路由 / Rethinking Cross-Layer Information Routing in Diffusion Transformers


1️⃣ 一句话总结

本文系统分析了扩散Transformer模型中信息跨层流动的问题,发现传统残差连接会导致梯度衰减、信息冗余等三大症状,并提出了一种自适应路由机制(DAR),能在训练中动态调整每层信息的累积方式,大幅提升生成质量和训练效率。

源自 arXiv: 2605.20708