CoF-T2I:将视频模型作为纯视觉推理器用于文本到图像生成 / CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation
1️⃣ 一句话总结
这项研究提出了一种名为CoF-T2I的新方法,它巧妙地将视频生成模型中的‘帧链’推理能力用于文本生成图像任务,通过让模型像做视觉推理一样逐步优化图像细节,从而显著提升了生成图像的质量和美感。
Recent video generation models have revealed the emergence of Chain-of-Frame (CoF) reasoning, enabling frame-by-frame visual inference. With this capability, video models have been successfully applied to various visual tasks (e.g., maze solving, visual puzzles). However, their potential to enhance text-to-image (T2I) generation remains largely unexplored due to the absence of a clearly defined visual reasoning starting point and interpretable intermediate states in the T2I generation process. To bridge this gap, we propose CoF-T2I, a model that integrates CoF reasoning into T2I generation via progressive visual refinement, where intermediate frames act as explicit reasoning steps and the final frame is taken as output. To establish such an explicit generation process, we curate CoF-Evol-Instruct, a dataset of CoF trajectories that model the generation process from semantics to aesthetics. To further improve quality and avoid motion artifacts, we enable independent encoding operation for each frame. Experiments show that CoF-T2I significantly outperforms the base video model and achieves competitive performance on challenging benchmarks, reaching 0.86 on GenEval and 7.468 on Imagine-Bench. These results indicate the substantial promise of video models for advancing high-quality text-to-image generation.
CoF-T2I:将视频模型作为纯视觉推理器用于文本到图像生成 / CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation
这项研究提出了一种名为CoF-T2I的新方法,它巧妙地将视频生成模型中的‘帧链’推理能力用于文本生成图像任务,通过让模型像做视觉推理一样逐步优化图像细节,从而显著提升了生成图像的质量和美感。
源自 arXiv: 2601.10061