菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-06
📄 Abstract - UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.

顶级标签: multi-modal model training aigc
详细标签: self-improvement multimodal generation text-to-image cycle consistency self-supervised learning 或 搜索:

UniCorn:通过自生成监督实现自改进统一多模态模型 / UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision


1️⃣ 一句话总结

这篇论文提出了一个名为UniCorn的自我改进框架,它能让一个统一的多模态AI模型通过内部角色扮演和自我博弈,在没有外部数据或人工指导的情况下,显著提升自己根据文字描述生成高质量图像的能力,同时保持对图像内容的理解力。

源自 arXiv: 2601.03193