菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-18
📄 Abstract - LatentUMM: Dual Latent Alignment for Unified Multimodal Models

Unified multimodal models (UMMs) achieve strong performance in both understanding and generation by learning a shared latent space, yet they often exhibit functional inconsistency between these two capabilities. We observe that this issue does not stem from a lack of shared representations, but from the absence of explicit alignment between the transformations that map into and out of the latent space. As a result, generation and re-encoding can follow inconsistent trajectories, leading to semantic drift under modality transitions. In this work, we propose LatentUMM, a framework that constructs an enhanced shared latent space to explicitly align these transformations and improve cross-modal consistency. LatentUMM consists of two stages. First, dual latent alignment enforces consistency at both the modality and capacity levels: cross-modal alignment uses a stronger embedding model to impose structured cross-modal semantics, while dual capacity alignment enforces bidirectional consistency under generation and re-encoding. Second, latent dynamics stabilization improves robustness via stochastic latent rollouts and preference optimization, favoring trajectories that better preserve semantic consistency. Experiments show that LatentUMM consistently improves multimodal consistency across diverse architectures. Code is available at: this https URL.

顶级标签: multi-modal machine learning model training
详细标签: unified multimodal models latent space alignment semantic consistency preference optimization cross-modal alignment 或 搜索:

LatentUMM:面向统一多模态模型的双重潜空间对齐方法 / LatentUMM: Dual Latent Alignment for Unified Multimodal Models


1️⃣ 一句话总结

本文提出一种名为LatentUMM的新框架,通过在统一多模态模型中引入双重潜空间对齐(同时对齐不同模态间以及编码与生成过程),解决了模型在理解与生成任务之间存在的功能不一致问题,从而显著提升了跨模态转换时的语义一致性。

源自 arXiv: 2605.17766