菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-05
📄 Abstract - NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.

顶级标签: multi-modal model training aigc
详细标签: autoregressive transformer multimodal generation next-scale prediction image generation unified modeling 或 搜索:

NextFlow:统一序列建模激活多模态理解与生成 / NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation


1️⃣ 一句话总结

这篇论文提出了一个名为NextFlow的统一模型,它通过创新的‘下一尺度预测’方法,能同时理解和生成图文视频内容,并且生成高分辨率图像的速度比传统方法快得多。

源自 arXiv: 2601.02204