通过多表征生成增强统一多模态模型的理解能力 / Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation
1️⃣ 一句话总结
这篇论文提出了一种名为UniMRG的后训练方法,通过让统一多模态模型额外学习生成图像的像素、深度和分割图等多种内部表征,来帮助模型更全面、深入地理解视觉内容,从而同时提升其视觉理解和生成能力。
Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.
通过多表征生成增强统一多模态模型的理解能力 / Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation
这篇论文提出了一种名为UniMRG的后训练方法,通过让统一多模态模型额外学习生成图像的像素、深度和分割图等多种内部表征,来帮助模型更全面、深入地理解视觉内容,从而同时提升其视觉理解和生成能力。
源自 arXiv: 2601.21406