菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-11-28
📄 Abstract - VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.

顶级标签: multi-modal model training computer vision
详细标签: vector quantization unified representation vision transformer autoencoder semantic codebook 或 搜索:

VQRAE:用于多模态理解、生成与重建的表征量化自编码器 / VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction


1️⃣ 一句话总结

这篇论文提出了一个名为VQRAE的新型统一模型,它能够在一个框架内同时处理图像的理解、生成和精细重建任务,其核心创新在于使用一个高维语义编码本将连续语义特征和离散生成令牌统一起来。


源自 arXiv: 2511.23386