AR-Omni:一个用于任意模态间生成任务的统一自回归模型 / AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation
1️⃣ 一句话总结
这篇论文提出了一个名为AR-Omni的统一模型,它仅用一个自回归解码器就能同时处理文本、图像和语音的生成任务,并通过创新的训练和推理方法解决了多模态统一建模中的关键难题,实现了高质量且实时的多模态生成。
Real-world perception and interaction are inherently multimodal, encompassing not only language but also vision and speech, which motivates the development of "Omni" MLLMs that support both multimodal inputs and multimodal outputs. While a sequence of omni MLLMs has emerged, most existing systems still rely on additional expert components to achieve multimodal generation, limiting the simplicity of unified training and inference. Autoregressive (AR) modeling, with a single token stream, a single next-token objective, and a single decoder, is an elegant and scalable foundation in the text domain. Motivated by this, we present AR-Omni, a unified any-to-any model in the autoregressive paradigm without any expert decoders. AR-Omni supports autoregressive text and image generation, as well as streaming speech generation, all under a single Transformer decoder. We further address three practical issues in unified AR modeling: modality imbalance via task-aware loss reweighting, visual fidelity via a lightweight token-level perceptual alignment loss for image tokens, and stability-creativity trade-offs via a finite-state decoding mechanism. Empirically, AR-Omni achieves strong quality across three modalities while remaining real-time, achieving a 0.88 real-time factor for speech generation.
AR-Omni:一个用于任意模态间生成任务的统一自回归模型 / AR-Omni: A Unified Autoregressive Model for Any-to-Any Generation
这篇论文提出了一个名为AR-Omni的统一模型,它仅用一个自回归解码器就能同时处理文本、图像和语音的生成任务,并通过创新的训练和推理方法解决了多模态统一建模中的关键难题,实现了高质量且实时的多模态生成。
源自 arXiv: 2601.17761