菜单

🤖 系统
📄 Abstract - UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers

Recent image diffusion transformers achieve high-fidelity generation, but struggle to generate images beyond these scales, suffering from content repetition and quality degradation. In this work, we present UltraImage, a principled framework that addresses both issues. Through frequency-wise analysis of positional embeddings, we identify that repetition arises from the periodicity of the dominant frequency, whose period aligns with the training resolution. We introduce a recursive dominant frequency correction to constrain it within a single period after extrapolation. Furthermore, we find that quality degradation stems from diluted attention and thus propose entropy-guided adaptive attention concentration, which assigns higher focus factors to sharpen local attention for fine detail and lower ones to global attention patterns to preserve structural consistency. Experiments show that UltraImage consistently outperforms prior methods on Qwen-Image and Flux (around 4K) across three generation scenarios, reducing repetition and improving visual fidelity. Moreover, UltraImage can generate images up to 6K*6K without low-resolution guidance from a training resolution of 1328p, demonstrating its extreme extrapolation capability. Project page is available at \href{this https URL}{this https URL}.

顶级标签: computer vision model training model evaluation
详细标签: image diffusion resolution extrapolation positional embeddings attention mechanism high-resolution generation 或 搜索:

UltraImage:重新思考图像扩散变换器中的分辨率外推 / UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers


1️⃣ 一句话总结

本文提出了一种名为UltraImage的新方法,通过修正位置编码中的周期性频率和优化注意力机制,成功解决了现有图像扩散模型在生成超高分辨率图像时出现的重复内容和质量下降问题,实现了从1328p训练分辨率直接生成高达6K图像的卓越外推能力。


📄 打开原文 PDF