菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-12
📄 Abstract - One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

Diffusion transformers (DiTs) achieve high generative quality but lock FLOPs to image resolution, limiting principled latency-quality trade-offs, and allocate computation uniformly across input spatial tokens, wasting resource allocation to unimportant regions. We introduce Elastic Latent Interface Transformer (ELIT), a drop-in, DiT-compatible mechanism that decouples input image size from compute. Our approach inserts a latent interface, a learnable variable-length token sequence on which standard transformer blocks can operate. Lightweight Read and Write cross-attention layers move information between spatial tokens and latents and prioritize important input regions. By training with random dropping of tail latents, ELIT learns to produce importance-ordered representations with earlier latents capturing global structure while later ones contain information to refine details. At inference, the number of latents can be dynamically adjusted to match compute constraints. ELIT is deliberately minimal, adding two cross-attention layers while leaving the rectified flow objective and the DiT stack unchanged. Across datasets and architectures (DiT, U-ViT, HDiT, MM-DiT), ELIT delivers consistent gains. On ImageNet-1K 512px, ELIT delivers an average gain of $35.3\%$ and $39.6\%$ in FID and FDD scores. Project page: this https URL

顶级标签: model training computer vision multi-modal
详细标签: diffusion transformers latent interface compute efficiency dynamic inference resource allocation 或 搜索:

一个模型,多种预算:用于扩散变换器的弹性潜在接口 / One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers


1️⃣ 一句话总结

这篇论文提出了一种名为ELIT的弹性机制,它能让扩散变换器模型在生成图像时动态调整计算量,通过优先处理重要区域来在保持高质量的同时显著降低计算成本。

源自 arXiv: 2603.12245