菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-26
📄 Abstract - Self-Evaluation Unlocks Any-Step Text-to-Image Generation

We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps. We further find that its performance improves monotonically as inference steps increase, enabling both ultra-fast few-step generation and high-quality long-trajectory sampling within a single unified model. To our knowledge, Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation.

顶级标签: model training aigc multi-modal
详细标签: text-to-image flow matching self-evaluation any-step inference from-scratch training 或 搜索:

自评估解锁任意步数的文本到图像生成 / Self-Evaluation Unlocks Any-Step Text-to-Image Generation


1️⃣ 一句话总结

这篇论文提出了一种名为Self-E的全新训练方法,它通过让模型在训练时自我评估生成图像的质量,实现了无需预训练教师模型、能从零开始训练,并且能在任意推理步数(从几步到几十步)下都生成高质量图像的文本到图像生成模型。

源自 arXiv: 2512.22374