OpenVision 3:一个用于图像理解与生成的统一视觉编码器家族 / OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation
1️⃣ 一句话总结
这篇论文提出了一个名为OpenVision 3的新型视觉编码器,它通过一种统一的训练方法,让同一个模型既能很好地理解图像内容,也能有效地生成新图像,打破了传统上理解和生成任务需要不同模型的限制。
This paper presents a family of advanced vision encoder, named OpenVision 3, that learns a single, unified visual representation that can serve both image understanding and image generation. Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles. First, the encoder output is passed to the ViT-VAE decoder to reconstruct the original image, encouraging the representation to capture generative structure. Second, the same representation is optimized with contrastive learning and image-captioning objectives, strengthening semantic features. By jointly optimizing reconstruction- and semantics-driven signals in a shared latent space, the encoder learns representations that synergize and generalize well across both regimes. We validate this unified design through extensive downstream evaluations with the encoder frozen. For multimodal understanding, we plug the encoder into the LLaVA-1.5 framework: it performs comparably with a standard CLIP vision encoder (e.g., 62.4 vs 62.2 on SeedBench, and 83.7 vs 82.9 on POPE). For generation, we test it under the RAE framework: ours substantially surpasses the standard CLIP-based encoder (e.g., gFID: 1.89 vs 2.54 on ImageNet). We hope this work can spur future research on unified modeling.
OpenVision 3:一个用于图像理解与生成的统一视觉编码器家族 / OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation
这篇论文提出了一个名为OpenVision 3的新型视觉编码器,它通过一种统一的训练方法,让同一个模型既能很好地理解图像内容,也能有效地生成新图像,打破了传统上理解和生成任务需要不同模型的限制。
源自 arXiv: 2601.15369