TC-AE:为深度压缩自编码器解锁令牌容量 / TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders
1️⃣ 一句话总结
这篇论文提出了一种名为TC-AE的新方法,通过优化视觉变换器中图像块到潜在表示的压缩过程,并增强图像块的语义结构,有效解决了深度图像压缩时潜在表示质量下降的问题,从而显著提升了图像重建和生成的效果。
We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representations to maintain reconstruction quality under high compression ratios. However, this strategy often leads to latent representation collapse, which degrades generative performance. Instead of relying on increasingly complex architectures or multi-stage training schemes, TC-AE addresses this challenge from the perspective of the token space, the key bridge between pixels and image latents, through two complementary innovations: Firstly, we study token number scaling by adjusting the patch size in ViT under a fixed latent budget, and identify aggressive token-to-latent compression as the key factor that limits effective scaling. To address this issue, we decompose token-to-latent compression into two stages, reducing structural information loss and enabling effective token number scaling for generation. Secondly, to further mitigate latent representation collapse, we enhance the semantic structure of image tokens via joint self-supervised training, leading to more generative-friendly latents. With these designs, TC-AE achieves substantially improved reconstruction and generative performance under deep compression. We hope our research will advance ViT-based tokenizer for visual generation.
TC-AE:为深度压缩自编码器解锁令牌容量 / TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders
这篇论文提出了一种名为TC-AE的新方法,通过优化视觉变换器中图像块到潜在表示的压缩过程,并增强图像块的语义结构,有效解决了深度图像压缩时潜在表示质量下降的问题,从而显著提升了图像重建和生成的效果。
源自 arXiv: 2604.07340